Red Wind Quality Exploration by Siwei Liu

This report explores a dataset containing 1,599 red wines with 13 variables, 11 variables on the chemical properties of the wine, as well as the quality rating by experts and the identifier variable X.

Univariate Plots Section

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Our dataset consists 13 variables with almost 1600 observations. The first variable ‘X’ is the identifier variable of red wine and the last variable quality is the quality rating of the red wine by experts. All other variables are attributes of the red wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

The first thing that I wanted to explore is the distribution of the quality of red wines. According to the plot above, the quality of red wines is normally distributed.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1

Above I ploted three variables that are related to acid: fixed acidity, volatile acidity and citric acid. It seems like the distribution of fixed acidity and volatile acidity are a little bit right skewed. And citric acid is differently distributed from the two. There are two obvious peaks in citric acid, one of them is 0 and the other one is 0.5.

Next, since the density depends on the the percentage of alcohol and suger in the red wine, I’m going to explore these three variables together.

According to the plots above, the density of the red wine is normally distributed and the distribution of alcohol amount are right skewed. Residual sugar has a long tail, and pretty distant outliers so I’m going to do a little transformation.

After the log tranformation, the distribution looks more normal.

Next, I will plot three other ralated variables:free sulfur dioxide, total sulfur dioxide and sulphates.

These three variables are closely related according to the provided text documentation, and the plot above shows that they all have pretty strong right skewness. I will do a little tranformation on the plots to see if they will look more normally distributed if log transformation is applied.

After the transformation, especially the total sulfur dioxide and the sulphates look a lot more normally distributed.

Univariate Analysis

What is the structure of your dataset?

There are 1599 observations of red wine, with 13 variables(X, fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, ulphates, alcohol, quality). X and quality are integers and all other variables are floating point numbers.

Other observations:

  • The red wine quality is normally distributed, and most red wines here have a quality of 5 and 6.

  • The density difference between different wines in our dataset is very small, the minimum density is 0.9901, and the maximum density is 1.0037.

  • About 75% red wines have the volatile acidity less than or equal to 0.6400 g / \(dm^3\).

  • Most red wines have free sulfur dioxide less than 60 mg / \(dm^3\).

What is/are the main feature(s) of interest in your dataset?

The main features in the dataset are the quality of the red wine and the chemical properties such as acidity, sugar, chlorides, sulfur dioxide and alcohol. I’d like to determine which chemical properties influence the quality of red wines.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Volatile acidity, citric acid, total sulfur dioxide, alcohol and some combination of the other variables can be helpful in building a predictive model to the quality of red wine. According to the text documentation provided alongside with the red wine data, volatile acidity, citric acid and total sulfur dioxide will affect the taste of the wine, hense I think these variables will contribute most to the quality rate of the red wine.

Did you create any new variables from existing variables in the dataset?

No. Maybe in the future, when I have the inspiration that the mutation or the combination of the current variables is helpful in exploring the dataset, I will create some new variables.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The distribution of the total sulfur dioxide, free sulfur dioxide, sulphates and residual sugar are skewed right with some outliers, so I did log transformation on them. For the residual sugar, I also cropped the data a liitle bit the better look at the majority of the data.

Bivariate Plots Section

##                           X fixed.acidity volatile.acidity citric.acid
## X                     1.000        -0.268           -0.009      -0.154
## fixed.acidity        -0.268         1.000           -0.256       0.672
## volatile.acidity     -0.009        -0.256            1.000      -0.552
## citric.acid          -0.154         0.672           -0.552       1.000
## residual.sugar       -0.031         0.115            0.002       0.144
## chlorides            -0.120         0.094            0.061       0.204
## free.sulfur.dioxide   0.090        -0.154           -0.011      -0.061
## total.sulfur.dioxide -0.118        -0.113            0.076       0.036
## density              -0.368         0.668            0.022       0.365
## pH                    0.136        -0.683            0.235      -0.542
## sulphates            -0.125         0.183           -0.261       0.313
## alcohol               0.245        -0.062           -0.202       0.110
## quality               0.066         0.124           -0.391       0.226
##                      residual.sugar chlorides free.sulfur.dioxide
## X                            -0.031    -0.120               0.090
## fixed.acidity                 0.115     0.094              -0.154
## volatile.acidity              0.002     0.061              -0.011
## citric.acid                   0.144     0.204              -0.061
## residual.sugar                1.000     0.056               0.187
## chlorides                     0.056     1.000               0.006
## free.sulfur.dioxide           0.187     0.006               1.000
## total.sulfur.dioxide          0.203     0.047               0.668
## density                       0.355     0.201              -0.022
## pH                           -0.086    -0.265               0.070
## sulphates                     0.006     0.371               0.052
## alcohol                       0.042    -0.221              -0.069
## quality                       0.014    -0.129              -0.051
##                      total.sulfur.dioxide density     pH sulphates alcohol
## X                                  -0.118  -0.368  0.136    -0.125   0.245
## fixed.acidity                      -0.113   0.668 -0.683     0.183  -0.062
## volatile.acidity                    0.076   0.022  0.235    -0.261  -0.202
## citric.acid                         0.036   0.365 -0.542     0.313   0.110
## residual.sugar                      0.203   0.355 -0.086     0.006   0.042
## chlorides                           0.047   0.201 -0.265     0.371  -0.221
## free.sulfur.dioxide                 0.668  -0.022  0.070     0.052  -0.069
## total.sulfur.dioxide                1.000   0.071 -0.066     0.043  -0.206
## density                             0.071   1.000 -0.342     0.149  -0.496
## pH                                 -0.066  -0.342  1.000    -0.197   0.206
## sulphates                           0.043   0.149 -0.197     1.000   0.094
## alcohol                            -0.206  -0.496  0.206     0.094   1.000
## quality                            -0.185  -0.175 -0.058     0.251   0.476
##                      quality
## X                      0.066
## fixed.acidity          0.124
## volatile.acidity      -0.391
## citric.acid            0.226
## residual.sugar         0.014
## chlorides             -0.129
## free.sulfur.dioxide   -0.051
## total.sulfur.dioxide  -0.185
## density               -0.175
## pH                    -0.058
## sulphates              0.251
## alcohol                0.476
## quality                1.000

From the correlation matrix above we can see that the top factors that are correlated with the red wine quality are : alcohol(0.467), volatile.acidity(-0.391), sulphates(0.251), citric.acid(0.226).

Let’s use the boxplot to take a better look at these four variables that correlated with the quality of red wines.

First, let’s plot quality with alcohol:

It seems that starting from the quality 5, the more alcohol the red wines contain, the better quality they have.

Secondly, let’s plot quality with volatile.acidity:

There is very obvious pattern that the more volatile acid red wines contain, the worse quality they have. And it agrees with the text documentation provided with the red wine data, ‘volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste’.

Thirdly, let’s take a look at the quality with sulphates:

According to the plot above, the more the sulphates, the better the quality of the red wine, but it seems that the effect of sulphates is not as significant as the other two variables, alcohol and volatile acidity.

Finally, I’m going to plot quality with citric.acid:

According to the author’s documentation provided alongside with the red wine data, ‘Citric acid can add ’freshness’ and flavor to wines’. And our plot proves this. The plot shows that the quality and citric acid are positively correlated.

Besides the most correlated four variables above, I also noticed that the density and the quality have a correlation of -0.175, the correlation is not as strong as the four variables that I analized above, but is still pretty strong compared with other variables. I suspect that part of this is because the density has a -0.496 correlation with alcohol, and alcohol is one of the variables that most strongly correlated with the qulity.

It makes sense. The amount of alcohol is positively related with quality and density is negatively related with alcohol, that is why we see a negative correlation between density and quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

According to the plots and data above, the most decisive feature of the quality of red wines are alcohol(0.467), volatile.acidity(-0.391), sulphates(0.251), citric.acid(0.226). The amounts of alcohol, sulphates and citric acid are positively related with the quality of the wine and the other decisive variable volatile acidity is negatively related with the quality of the red wines.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I noticed that the density and the quality have a correlation of -0.175. And I suspect that part of because the density has a -0.496 correlation with alcohol, and alcohol is one of the variables that most strongly and positively correlated with the qulity.

What was the strongest relationship you found?

The strongest relationship I found is alcohol and quality, which are positively correlated.

Multivariate Plots Section

I made 6 plots about the 4 variables that most correlated with the quality. And in these plots, I used different color to represent diffrent quality. We can see that most high quality red wines have high alcohol level and low volatile acidity. And high quality red wines have relatively high sulphates and citric acid.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The four most decisive variables: alcohol, volatile.acidity, sulphates, citic.acid strengthen each other. All 6 plots seem to show that the high quality wines tend to have high level of alcohol, citric.acid and sulphates, and low level of volatile.acidity.

Were there any interesting or surprising interactions between features?

According to the text documentation by the author, ‘volatile acidity at too high of levels can lead to an unpleasant, vinegar taste’. I noticed that red wines that contain volatile acidity higher than 1.1 almost never get the high quality rating.


Final Plots and Summary

Plot One

##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Description One

I choose the quality plot as my first plot because quality is the number one feature I care about in this research. According to the plot, most red wines have a quality of 5 or 6, few get a quality rating of 4 and even fewer get 3 and 8, which are two extremes of red wine quality. Also, I noticed that even though the quality rating ranges from 0 to 10 according to the author’s documentation, no red wine receives a rating of less than 3 or higher than 10. This may be due to how the quality rating comes from. According the information provided by the author, at least 3 wine experts rated the quality of each wine. And I guess the final result is based on the average rating or the median rating of the experts, since it’s unlikely that all of them give a full rating or a rating of 0, it makes sense that extreme ratings like 0 and 10 doesn’t exist.

Plot Two

Description Two

For plot two, I choose the boxplots of quality with four variables: alcohol, volatile acidity, sulphates and citric acid, because these four variables are the four most influential variable on the quality of red wines. Firstly, higher alcohol level tend to have higher quality. Secondly, quality and volatile acidity are negatively correlated, however, there is no significant difference in volatile acidity for quality 7 and quality 8. Thirdly, quality and sulphates are positively correlated but the correlation is not as strong as the correlation of quality and alcohol. Finally, there is an obvious pattern that the red wines with higher quality have higher citric acid level.

Plot Three

Description Three

I choose this plot because it plots qulity and the top two influential variables together in one plot. We can see that high quality red wines tend to favor high alcohol level, especially for alcohol level greater than 12% of the red wine volume. It also shows that high volatile acidity tends to prevent red wines from receiving a high quality rating.


Reflection

In this research I explored a dataset containing 1,599 red wines with 13 variables, 11 variables on the chemical properties of the wine, as well as the quality rating by experts and the identifier variable X. Through the exploration of one variable, two variables and three variables, I found the four variables that influence the quality of red wine most: alcohol, volatile acidity, sulphates and citric acid. The plots and the correlation matrix show that alcohol, sulphates and citric acid are positively correlated with the quality of red wine while volatile acidity is negatively related with the quality.

I think everything went smoothly except for the following two things: